Detecting Annotation Errors in Spoken Language Corpora

نویسندگان

  • Markus Dickinson
  • W. Detmar Meurers
چکیده

Consistency of corpus annotation is an essential property for the many uses of annotated corpora in computational and theoretical linguistics. While some research addresses the detection of inconsistencies in part-of-speech and other positional annotation (van Halteren, 2000; Eskin, 2000; Dickinson and Meurers, 2003a), more recently work has also started to address errors in syntactic and other structural annotation (Dickinson and Meurers, 2003b, 2005; Ule and Simov, 2004; Dickinson, 2005). Spoken language differs in many respects from written language, but to the best of our knowledge the issue of detecting errors in the annotation of spoken language corpora has not yet been systematically addressed. This is significant since spoken data is increasingly relevant for linguistic and computational research—and such corpora are starting to become more readily available, as illustrated by the holdings of the Linguistic Data Consortium (http://www.ldc.upenn.edu). This paper addresses the issue, based on the variation n-gram error detection approach developed in Dickinson and Meurers (2003a). We use the German Verbmobil treebank (Hinrichs et al., 2000) as an exemplar of a spoken language corpus and discuss properties of such corpora which are relevant when adapting the variation n-gram approach for detecting errors in syntactic annotation of spoken language corpora.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards Detecting Annotation Errors in Spoken Language Corpora

The issue Consistency of corpus annotation is an essential property for the many uses of annotated corpora in computational and theoretical linguistics. While some research addresses the detection of inconsistencies in part-of-speech and other positional annotation (van Halteren, 2000; Eskin, 2000; Dickinson and Meurers, 2003a), only recently has there been some work in detecting errors in synt...

متن کامل

Transcribing Speech: Errors in Corpora and Experimental Settings

Administrations, government organs, judiciary courts always faced the problem of defining limits in transcription practices. Nowadays corpus linguistics and computational linguistics have focused their attention on spoken corpora as indispensable tools for descriptive linguistics, as well as for applied purposes (in speech technologies, such as text-to-speech and speech recognition, in dialogue...

متن کامل

DECCA Project Description

In the past decade, research and applications in human language technology have strongly been influenced by the success of data-driven and stochastic modeling of natural language based on electronic corpora annotated with linguistic information. Annotated corpora are fundamental for training and testing algorithms in statistical natural language processing, and they are essential as gold standa...

متن کامل

A multi-level multimedia concordancer for spoken language corpora (Un concordancier multi-niveaux et multimédia pour des corpus oraux) [in French]

Concordances have always played an important role in the analysis of language corpora, for studies in humanities, literature, linguistics, translation and language teaching. However, very few of the available systems support multi-level queries against a richly-annotated, sound-aligned spoken corpus. The rapid growth in the development of spoken corpora, particularly for French, increases the n...

متن کامل

Slips and errors in spoken data transcription

The present work illustrates the main results of an experiment on errors and repairs in spoken language transcription, with significant relevance for the evaluation of validity, reliability and correctness of transcriptions of speech belonging to several different typologies, set for the annotation of spoken corpora. In particular, we dealt with errors and repair strategies that appear on the f...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005